49 research outputs found
Rethinking matching-based few-shot action recognition
Few-shot action recognition, i.e. recognizing new action classes given only a
few examples, benefits from incorporating temporal information. Prior work
either encodes such information in the representation itself and learns
classifiers at test time, or obtains frame-level features and performs pairwise
temporal matching. We first evaluate a number of matching-based approaches
using features from spatio-temporal backbones, a comparison missing from the
literature, and show that the gap in performance between simple baselines and
more complicated methods is significantly reduced. Inspired by this, we propose
Chamfer++, a non-temporal matching function that achieves state-of-the-art
results in few-shot action recognition. We show that, when starting from
temporal features, our parameter-free and interpretable approach can outperform
all other matching-based and classifier methods for one-shot action recognition
on three common datasets without using temporal information in the matching
stage. Project page: https://jbertrand89.github.io/matching-based-fsarComment: Accepted at SCIA 202
Fake it till you make it: Learning transferable representations from synthetic ImageNet clones
Recent image generation models such as Stable Diffusion have exhibited an
impressive ability to generate fairly realistic images starting from a simple
text prompt. Could such models render real images obsolete for training image
prediction models? In this paper, we answer part of this provocative question
by investigating the need for real images when training models for ImageNet
classification. Provided only with the class names that have been used to build
the dataset, we explore the ability of Stable Diffusion to generate synthetic
clones of ImageNet and measure how useful these are for training classification
models from scratch. We show that with minimal and class-agnostic prompt
engineering, ImageNet clones are able to close a large part of the gap between
models produced by synthetic images and models trained with real images, for
the several standard classification benchmarks that we consider in this study.
More importantly, we show that models trained on synthetic images exhibit
strong generalization properties and perform on par with models trained on real
data for transfer. Project page: https://europe.naverlabs.com/imagenet-sd/Comment: Accepted to CVPR 202
Concept Generalization in Visual Representation Learning
Measuring concept generalization, i.e., the extent to which models trained on
a set of (seen) visual concepts can be used to recognize a new set of (unseen)
concepts, is a popular way of evaluating visual representations, especially
when they are learned with self-supervised learning. Nonetheless, the choice of
which unseen concepts to use is usually made arbitrarily, and independently
from the seen concepts used to train representations, thus ignoring any
semantic relationships between the two. In this paper, we argue that semantic
relationships between seen and unseen concepts affect generalization
performance and propose ImageNet-CoG, a novel benchmark on the ImageNet dataset
that enables measuring concept generalization in a principled way. Our
benchmark leverages expert knowledge that comes from WordNet in order to define
a sequence of unseen ImageNet concept sets that are semantically more and more
distant from the ImageNet-1K subset, a ubiquitous training set. This allows us
to benchmark visual representations learned on ImageNet-1K out-of-the box: we
analyse a number of such models from supervised, semi-supervised and
self-supervised approaches under the prism of concept generalization, and show
how our benchmark is able to uncover a number of interesting insights. We will
provide resources for the benchmark at
https://europe.naverlabs.com/cog-benchmark
No Reason for No Supervision: Improved Generalization in Supervised Models
We consider the problem of training a deep neural network on a given
classification task, e.g., ImageNet-1K (IN1K), so that it excels at both the
training task as well as at other (future) transfer tasks. These two seemingly
contradictory properties impose a trade-off between improving the model's
generalization and maintaining its performance on the original task. Models
trained with self-supervised learning tend to generalize better than their
supervised counterparts for transfer learning; yet, they still lag behind
supervised models on IN1K. In this paper, we propose a supervised learning
setup that leverages the best of both worlds. We extensively analyze supervised
training using multi-scale crops for data augmentation and an expendable
projector head, and reveal that the design of the projector allows us to
control the trade-off between performance on the training task and
transferability. We further replace the last layer of class weights with class
prototypes computed on the fly using a memory bank and derive two models: t-ReX
that achieves a new state of the art for transfer learning and outperforms top
methods such as DINO and PAWS on IN1K, and t-ReX* that matches the highly
optimized RSB-A1 model on IN1K while performing better on transfer tasks. Code
and pretrained models: https://europe.naverlabs.com/t-rexComment: Accepted to ICLR 2023 (spotlight